23 Feb 2017

Automatically generated data

With this data we might like to:

  • Look for trends over time

With this data we might like to:

  • Compare different moments in time

With this data we might like to:

Do other time series analysis

  • Look for seasonality
  • Fit ARIMA models
  • Calculate a moving average

But also other types of analysis involve processing timestamp data.

However our data looks like this

library(padr)
library(dplyr)
padr::emergency %>% head
## # A tibble: 6 × 6
##        lat       lng   zip                   title          time_stamp
##      <dbl>     <dbl> <int>                   <chr>              <dttm>
## 1 40.29788 -75.58129 19525  EMS: BACK PAINS/INJURY 2015-12-10 17:40:00
## 2 40.25806 -75.26468 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00
## 3 40.12118 -75.35198 19401     Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00
## 4 40.11615 -75.34351 19401  EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01
## 5 40.25149 -75.60335    NA          EMS: DIZZINESS 2015-12-10 17:40:01
## 6 40.25347 -75.28324 19446        EMS: HEAD INJURY 2015-12-10 17:40:01
## # ... with 1 more variables: twp <chr>

padr helps out with two challenges

Every row is a single observation, typically on second level. You want to do analysis on a (much) higher level.

  • padr offers: thicken used in conjunction with a database package, like dplyr.
emergency %>% thicken(interval = "month") %>% 
  count(time_stamp_month) %>% head
## # A tibble: 6 × 2
##   time_stamp_month     n
##             <date> <int>
## 1       2015-12-01  7969
## 2       2016-01-01 13205
## 3       2016-02-01 11467
## 4       2016-03-01 11101
## 5       2016-04-01 11326
## 6       2016-05-01 11423

padr helps out with two challenges

When there is no observation, there is no record.

  • padr offers: pad
data.frame(dt  = as.Date(c("2017-02-23", "2017-02-26")), 
           val = c(2, 4)) %>% pad
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  NA
## 3 2017-02-25  NA
## 4 2017-02-26   4

The interval

Think of timedata as having a hearbeat. It produces data at a certain interval.

padr currently uses eight intervals: year, quarter, month, week, day, hour, minute, and second.

get_interval(emergency$time_stamp)
## [1] "sec"

The interval is the highest of the eight that can explain all the instances observed in the data.

dt <- as.Date(c("2017-02-23", "2017-02-26"))
all(dt %in% seq(dt %>% min, dt %>% max, by = 'day'))
## [1] TRUE

thicken

The thicken function takes in a data frame, then it does:

  • look for the datetime variable in the data frame.
  • assess the interval of this variable.
  • span a variable of a higher interval around it.
  • assign each original observation to a value in the spanned variable.
  • add the the assignments to the original data frame.

thicken

thicken

thicken parameters:

x

interval = c("level_up", "year", "quarter", "month", 
    "week", "day", "hour", "min")

colname = NULL

rounding = c("down", "up")

by = NULL

start_val = NULL

pad

The pad function takes in a data frame, then it does:

  • look for the datetime variable in the data frame.
  • assess the interval of this variable.
  • span a variable of the same interval around it.
  • merge the original variable with the spanned variable.
  • leave NA values for the other variables.

pad

pad

x

interval = NULL

start_val = NULL

end_val = NULL

by = NULL

group = NULL

pad

Last week v0.2.0 came out (and patch release v0.2.1 :) ), that introduced group padding.

emergency %>% 
  thicken('month', col = "m") %>% 
  count(m, title) %>% 
  pad(group = "title", 
      start_val = as.Date("2015-12-01"),
      end_val   = as.Date("2016-10-01"))
## # A tibble: 1,287 × 3
##             m                title     n
## *      <date>                <chr> <int>
## 1  2015-12-01 EMS: ABDOMINAL PAINS   128
## 2  2016-01-01 EMS: ABDOMINAL PAINS   186
## 3  2016-02-01 EMS: ABDOMINAL PAINS   161
## 4  2016-03-01 EMS: ABDOMINAL PAINS   184
## 5  2016-04-01 EMS: ABDOMINAL PAINS   185
## 6  2016-05-01 EMS: ABDOMINAL PAINS   162
## 7  2016-06-01 EMS: ABDOMINAL PAINS   158
## 8  2016-07-01 EMS: ABDOMINAL PAINS   143
## 9  2016-08-01 EMS: ABDOMINAL PAINS   176
## 10 2016-09-01 EMS: ABDOMINAL PAINS   174
## # ... with 1,277 more rows

Fill the missings

After padding you are left with the missing values for the imputed records.

padded_df <- 
  data.frame(dt  = as.Date(c("2017-02-23", "2017-02-25", 
                             "2017-02-27")), 
           val = c(2, 4, 2)) %>% pad

padded_df
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  NA
## 3 2017-02-25   4
## 4 2017-02-26  NA
## 5 2017-02-27   2

Fill the missings

Depending on the nature of the data you might want to:

Carry the last value forward

padded_df %>% 
  tidyr::fill(val)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24   2
## 3 2017-02-25   4
## 4 2017-02-26   4
## 5 2017-02-27   2

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with the same value

padded_df %>% 
  fill_by_value(val, value = 42)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  42
## 3 2017-02-25   4
## 4 2017-02-26  42
## 5 2017-02-27   2

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with a function of the nonmissings

padded_df %>% 
  fill_by_function(val, fun = mean)
##           dt      val
## 1 2017-02-23 2.000000
## 2 2017-02-24 2.666667
## 3 2017-02-25 4.000000
## 4 2017-02-26 2.666667
## 5 2017-02-27 2.000000

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with the most prevalent of the nonmissings

padded_df %>% 
  fill_by_prevalent(val)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24   2
## 3 2017-02-25   4
## 4 2017-02-26   2
## 5 2017-02-27   2

Final example to wrap up

library(ggplot2)
animal_bites_plot <- 
  emergency %>% 
  filter(title == 'EMS: ANIMAL BITE') %>% 
  thicken(interval = 'day', col = 'ts_day') %>% 
  count(ts_day) %>% 
  pad() %>% 
  fill_by_value(n) %>% 
  ggplot(aes(ts_day, n)) +
  geom_point() +
  geom_line() +
  geom_smooth()

Final example to wrap up

animal_bites_plot

Future plans

Spanning instead of date altering is very powerful, full potential is not used now.

Enable the user to apply a custom span, seq is very flexible.

seq(as.Date('2017-02-23'), as.Date('2017-03-03'), by = "3 days")
## [1] "2017-02-23" "2017-02-26" "2017-03-01"

Still need to figure out how to fit it in neatly with the interval paradigm.

More information

There are two vignettes, a general introduction and more details on the implementation.

vignette("padr")
vignette("padr_implementation")

I blog about changes in padr on: thats-so-random.com

And the package is maintained on: github.com/EdwinTh/padr